It is important in any data science project to define the objective as specific as possible. Below let's write it from general to specific. This will direct your analysis.
Section 1: Initial Steps
Section 2: Data Cleaning and Preparation
Section 3: EDA of all variables
I have commented this EDA section as detailed EDA and all the graphs was already shown in previous 2 submissions.
Section 4: Feature Engineering
Section 5: Autoencoders, and iForest
An autoencoder is a special type of neural network that copies the input values to the output values. It does not require the target variable like the conventional Y, thus it is categorized as unsupervised learning.
Indeed, we are not so much interested in the output layer. We are interested in the hidden core layer. If the number of neurons in the hidden layers is less than that of the input layers, the hidden layers will extract the essential information of the input values. This condition forces the hidden layers to learn the most patterns of the data and ignore the “noises”. So in an autoencoder model, the hidden layers must have fewer dimensions than those of the input or output layers. If the number of neurons in the hidden layers is more than those of the input layers, the neural network will be given too much capacity to learn the data. In an extreme case, it could just simply copy the input to the output values, including noises, without extracting any essential information.
Isolation Forest ot iForest is an anomaly detection algorithm created by Fei Tony Liu et al. They argue that most of the existing approaches to anomaly detection find the norm first, then identify observations that do not conform to the norm. They propose the Isolation Forest as an alternative approach — explicitly isolating anomalies instead of profiling normal data points. Anomalies are isolated closer to the root of the tree; whereas normal points are isolated at the deeper end of the tree. They call each tree the Isolation Tree or iTree. This isolation characteristic of tree forms the basis to detect anomalies.
When we model the data with an iTree, outliers usually have short path lengths on a tree. On the other hands, the bulb of the normal observations requires many tree branches. So the number of depths becomes a good proxy for the anomaly scores. It is important to note that this iTree algorithm is different from the decision tree algorithm because iTree does not use a target variable to train the tree. It is an unsupervised learning method.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import scipy
import time
import seaborn as sns
sns.set(style="whitegrid")
import warnings
warnings.filterwarnings("ignore")
import missingno as msno
from sklearn.impute import SimpleImputer
from sklearn import metrics
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from pprint import pprint
import zipcodes
import plotly
import plotly.express as px
from pyod.models.iforest import IForest
from pyod.models.auto_encoder import AutoEncoder
from pyod.models.combination import aom, moa, average, maximization
from pyod.utils.utility import standardizer
from pyod.utils.data import generate_data, get_outliers_inliers
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
# Read the data
df = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 6 PyOD/inpatientCharges.csv')
df.head()
# Rows and Columns
print(" ")
print("Number of rows and columns in the dataset:")
df.shape
print(" ")
print("Basic statistics of the numerical columns are as follows:")
# Check basic statistics
df.describe()
# Check the column dtypes
df.info()
Check for categorical/object fields from the data variable descriptions. Convert the relevant numeric fields to their respective categorical/object fields:
Provider Id
The Zipcode column is represented as an integer. Convert it to zipcode format.
Variable Hospital Referral Region Description comprises of the State and the city, which I see is the nearest metro city.
Average Covered Charges is not significant for our analysis, it will be for other purposes such claims fraud, insurance premiums, etc.
The two payments columns need to be converted to proper numeric formats.
# Basic Sort of Provider ID and DRG Definition
df = df.sort_values(['Provider Id', 'DRG Definition'])
df.head(2)
The given column names have a lot of spaces, trailing spaces, etc. I will rename all the columns as per appropriate naming convention.
df.columns = ['DRG_Definition', 'Provider_Id', 'Provider_Name',
'Provider_Street_Address', 'Provider_City', 'Provider_State',
'Provider_Zip_Code', 'Hospital_Referral_Region_Description',
'Total_Discharges', 'Average_Covered_Charges',
'Average_Total_Payments', 'Average_Medicare_Payments']
df.columns
df["Average_Total_Payments"] = df["Average_Total_Payments"].str[1:].astype(float)
df["Average_Medicare_Payments"] = df["Average_Medicare_Payments"].str[1:].astype(float)
df["Average_Covered_Charges"] = df["Average_Covered_Charges"].str[1:].astype(float)
df["Provider_Id"] = df["Provider_Id"].astype(object)
df.head(2)
Provided_Zip_Code from integer to appropriate Zipcode format ¶# Zipcode to 5 character integer zipcode format
df['Provider_Zip_Code'] = df['Provider_Zip_Code'].astype(str).str.zfill(5)
df.isnull().sum().sum()
There are no NA's, which is good for our analysis.
Provider_State variable. ¶Regions will be a a very useful feature when performing the Exploratory Data Analysis.
states = pd.read_csv('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 6 PyOD/states.csv')
states.head(5)
# Left join the new dataset
df = pd.merge(left = df, right = states, left_on = 'Provider_State', right_on = 'State Code', how = 'left')
df.head(2)
# Remove duplicate 'state' column
df = df.drop(columns = ['State', 'State Code'])
Dataset source: https://www.psc.isr.umich.edu/dis/census/Features/tract2zip/
This has the zipcode wise mean and median income data for 2006 to 2010
income_df = pd.read_excel('/Users/harshdhanuka/Desktop/Columbia Class Matter/SEM 3/5420 Anomaly Detection/Assignment 6 PyOD/MedianZIP-3.xlsx')
income_df['Zip'] = income_df['Zip'].astype(str).str.zfill(5)
income_df.head(3)
income_df.isnull().sum()
df = pd.merge(left = df, right = income_df, left_on = 'Provider_Zip_Code', right_on = 'Zip', how = 'left')
df = df.drop(columns = ['Zip','Mean'])
df.rename(columns={'Median':'Zip_Median_Income',
'Pop':'Zip_Population'}, inplace=True)
df.head(2)
.
.
DRG_Definition Distribution ¶Explore the total number of DRG Definitions, and the count of how many times they appear.
df_count = df['DRG_Definition'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['DRG_Definition','Count']
df_count.head()
fig = px.bar(df_count, x = 'DRG_Definition', y = 'Count', color = 'DRG_Definition',
width=1450, height=500,
title = "Distribution of DRG Definitions")
fig.show()
The DRG Definition has a seemingly good distribution. The counts of DRG Definitons range from around 3000 to 600. All other DRG Definition counts lie within this range.
Here, any DRG Definition count doesnt seem like an outlier and all behave normally.
Provider_Name Distribution, see popular Providers¶Explore the total number Provider Names and the count of how many times each one appears.
df_count = df['Provider_Name'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_Name','Count']
df_count.head()
# Show only those Provider_Names whose total count is above 100
df_count1 = df_count.loc[df_count['Count'] > 100]
fig = px.bar(df_count1, x='Provider_Name', y='Count',
width=1200, height=500,
color = 'Provider_Name',
title = "Distribution of Provider Names, only showing for Count > 100")
fig.show()
# Show only those Provider_Names whose total count is below 3
df_count1 = df_count.loc[df_count['Count'] < 3]
fig = px.bar(df_count1, x='Provider_Name', y='Count',
width=1200, height=500,
color = 'Provider_Name',
title = "Distribution of Provider Names, only showing for Count < 3")
# fig.show()
From the above two count charts, it is clear than some Providers are extremely popular, and have around 600 entries. They seem to be the ones who provide services under multiple DRG Definitions.
While, some Providers are very unpopular, and have only 1 entry. Now, this depends on the DRG Definition, as some hospitals be a single specialty hospital, and hence everyone goes there only.
Provider_City Distribution, see popular Cities¶Explore the total number Cities and the count of how many times each one appears.
df_count = df['Provider_City'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_City','Count']
df_count.head()
# Show only those Provider_Cities whose total count is above 500
df_count1 = df_count.loc[df_count['Count'] > 500]
fig = px.bar(df_count1, x='Provider_City', y='Count',width=1000, height=500,
color = 'Provider_City',
title = "Distribution of Provider Cities, only showing for Count >500")
# fig.show()
# Show only those Provider_Cities whose total count is below 5
df_count1 = df_count.loc[df_count['Count'] < 5]
fig = px.bar(df_count1, x='Provider_City', y='Count',width=1000, height=500,
color = 'Provider_City',
title = "Distribution of Provider Cities, only showing for Count > 5")
# fig.show()
Chicago is the most popular city with around 1500 entries. There are also a lot of other cities which have only 1 entry.
Provider_State Distribution, see popular States¶Explore the total number States and the count of how many times each one appears.
df_count = df['Provider_State'].value_counts()
df_count = pd.DataFrame(df_count).reset_index()
df_count.columns = ['Provider_State','Count']
df_count.head()
fig = px.bar(df_count, x='Provider_State', y='Count',
width=1000, height=500,
color = 'Provider_State',
title = "Distribution of Provider State")
# fig.show()
The states seem to have a good distribution. There seems to be no outliers or staes requiring special attention.
Average_Total_Payments Distribution¶fig = px.histogram(df, x="Average_Total_Payments",
width=1000, height=500,
title = "Distribution of Average Total Payments")
# fig.show()
fig = px.box(df, x="Average_Total_Payments",width=1000, height=500,
title = "Distribution of Average Total Payments")
fig.show()
fig = px.violin(df, y="Average_Total_Payments", box=True,
points='all',width=800, height=500,
title = "Distribution of Average Total Payments")
# fig.show()
Most of the Average payments are less than USD 11,000. So, any average payment above that might be a reason for futher investigation.
There are some extreme high values, more than USD 150,000 which may need investigation.
Average_Medicare_Payments Distribution¶fig = px.histogram(df, x="Average_Medicare_Payments",
width=1000, height=500,
title = "Distribution of Average Medicare Payments")
# fig.show()
fig = px.box(df, x="Average_Medicare_Payments",width=1000, height=500,
title = "Distribution of Average Medicare Payments")
# fig.show()
Most of the Average Medicare payments are less than USD 10,000. So, any average payment above that might be a reason for futher investigation.
There are some extreme high values, more than USD 150,000 which may need investigation.
Total_Discharges Distruibution ¶fig = px.histogram(df, x="Total_Discharges",
width=800, height=500,
title = "Distribution of Total Discharges")
# fig.show()
fig = px.box(df, x="Total_Discharges",width=1000, height=500,
title = "Distribution of Total Discharges")
# fig.show()
Most of the Total Discharges are less than 49.
There are some extreme high values, such as 3383, which may need investigation.
Average_Total_Payments by DRG_Definition ¶fig = px.box(df, x="DRG_Definition", y="Average_Total_Payments",width=1400, height=500,
color = "DRG_Definition",
title = "The Distribution of Average Total Payments by DRG Definition")
fig.show()
Some DRG's have a very high Average Total Payments, these may be critical operations, which bear high cost.
Average_Total_Payments by State¶fig = px.box(df, x="Provider_State", y="Average_Total_Payments",width=1000, height=500,
color = "Provider_State",
title = "The Distribution of Average Total Payments by Provider State")
# fig.show()
The Average Total Payments are more or less similar, but some states such as NY and CA are very expensiove overall.
Average_Total_Payments by Region¶fig = px.box(df, x="Region", y="Average_Total_Payments",width=1000, height=500,
color = "Region",
title = "The Distribution of Average Total Payments by Region")
# fig.show()
# px.violin(df,x='Average_Total_Payments', y = "Region", color='Region',
# title = "The Distribution of Average Total Payments by Region",
# orientation='h').update_traces(side='positive',width=2)
The West region seems to be generally high in terms of Total Average Payments. This was verified earlier as we saw the state CA was extremely high as well.
It is followed by Northeast, which includes the state NY.
Total_Discharges by DRG_Definition¶fig = px.box(df, x="DRG_Definition", y="Total_Discharges",width=1400, height=500,
color = "DRG_Definition",
title = "The Distribution of Total Discharges by DRG Definition")
# fig.show()
The Discharge rate for some DRG's is very high, while most others have a balanced discharged rate.
Total_Discharges by Region ¶fig = px.box(df, x="Region", y="Total_Discharges",width=1000, height=500,
color = "Region",
title = "The Distribution of Total Discharges by Region")
# fig.show()
# px.violin(df,x='Total_Discharges', y = "Region", color='Region',
# title = "The Distribution of Total Discharges by Region",
# orientation='h').update_traces(side='positive',width=2)
Most regions have a similar total discharged pattern. However, the Northeast region has an outlier.
Average_Total_Payments and Average_Medicare_Payments ¶fig = px.scatter(df, x="Average_Total_Payments", y="Average_Medicare_Payments",
size = "Average_Total_Payments", color = 'Average_Total_Payments',
size_max=60,width=800, height=600)
# fig.show()
As the average total payments increase, the average medicare payments also increase, which shows that there is a very high collinearity between these two variables.
agg_columns = ['mean', 'median', 'var', 'std', 'count', 'min', 'max']
groupby_drg = df[['DRG_Definition', 'Average_Total_Payments']].groupby(by='DRG_Definition').agg(agg_columns)
groupby_drg.columns = [header + '-' + agg_column
for header, agg_column in zip(groupby_drg.columns.get_level_values(0), agg_columns)]
groupby_drg.columns = groupby_drg.columns.get_level_values(0)
groupby_drg.reset_index(inplace=True)
groupby_drg['Average_Total_Payments-range'] = groupby_drg['Average_Total_Payments-max'] - groupby_drg['Average_Total_Payments-min']
groupby_drg.head(2)
def plt_setup(_plt):
_plt.tick_params(
axis='x', # changes apply to the x-axis
which='both', # both major and minor ticks are affected
bottom='off', # ticks along the bottom edge are off
top='off', # ticks along the top edge are off
labelbottom='off')
plt.figure(figsize=(20,8))
# sns.barplot(x='DRG_Definition', y='Average_Total_Payments-mean',
# data=groupby_drg.sort_values('Average_Total_Payments-mean'))
plt_setup(plt)
plt.title('Mean Average Total Payments by DRG', fontsize=16)
plt.ylabel('Mean of Average Total Payments', fontsize=16)
Some DRG groups have very high mean, which implies that there are some DRG groups which generally charge a very high amount for treatment in terms of 'Total Payments'.
pyt_by_drg = df.groupby('DRG_Definition').sum().reset_index()
pyt_by_drg = pyt_by_drg.sort_values('Average_Total_Payments', ascending=False)
# pyt_by_drg.head()
# Extract only rows with amount > 40,000,000
pyt_by_drg = pyt_by_drg.loc[pyt_by_drg['Average_Total_Payments'] > 40000000]
# plt.figure(figsize=(20,4))
# fig = sns.barplot(x='DRG_Definition', y='Average_Total_Payments',
# data=pyt_by_drg)
# fig.set_xticklabels(fig.get_xticklabels(), rotation=15)
# plt.title('Mean Average Total Payments by DRG, for total > 40,000,000', fontsize=16)
# plt.ylabel('Mean of Average Total Payments', fontsize=16)
The DRG 329 is the highest in terms of total sum fom the Average Total Payments.
unique_ids = len(df.groupby('Provider_Id').count())
unique_providers = len(df.groupby('Provider_Name').count())
unique_cities = len(df.groupby('Provider_City').count())
unique_states = len(df.groupby('Provider_State').count())
print(" ")
print(f'There are {unique_ids} unique provider id values in the data, and {unique_providers} unique provider names in a total of {unique_cities} unique cities, and {unique_states} states.')
print(" ")
fig = sns.pairplot(df[['Region', 'Total_Discharges', 'Average_Total_Payments','Average_Medicare_Payments']],
hue= 'Region')
fig
corr = df[['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments']].corr()
f,ax = plt.subplots(figsize=(7,5))
sns.heatmap(corr, annot=True, cmap='Reds', linewidths=.4, fmt= '.1f',ax=ax)
# plt.show()
From above graphs, there are some variables that are highly correlated such as Average Total Payment and Average Medicare Payment. Average total payment has a long tail distribution, which could indicate potential fraud.
From corr matrix: Total payment is correlated with medicare payment.
We can conclude that those variables are indeed related, for modeling purposes, it more make sense to include only one or two of the three variables.
plt.figure(figsize=(20,20))
g = sns.PairGrid(df,
x_vars = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments'],
y_vars = ['Provider_State'],
height=10, aspect=.25)
# Draw plot
g.map(sns.stripplot, size=10, orient="h",
palette="ch:s=1,r=-.1,h=1_r", linewidth=1.10, edgecolor="w")
# Use the same x axis limits on all columns and add better labels
g.set(xlabel="", ylabel="")
titles = ["Total Discharges", "Average Total Payments",
"Average Medicare Paymens"]
for ax, title in zip(g.axes.flat, titles):
# Set a different title for each axes
ax.set(title = title)
# Make the grid horizontal instead of vertical
ax.xaxis.grid(False)
ax.yaxis.grid(True)
# plt.tight_layout()
# plt.show()
#plt.figure(figsize=(20,20))
#g = sns.PairGrid(df,
# x_vars = ['Total_Discharges', 'Average_Total_Payments', 'Average_Medicare_Payments'],
# y_vars = ['Region'],
# height=10, aspect=.25)
# Draw plot
#g.map(sns.stripplot, size=10, orient="h",
# palette="ch:s=1,r=-.1,h=1_r", linewidth=1.10, edgecolor="w")
# Use the same x axis limits on all columns and add better labels
#g.set(xlabel="", ylabel="")
#titles = ["Total Discharges", "Average Total Payments",
# "Average Medicare Paymens"]
#for ax, title in zip(g.axes.flat, titles):
# Set a different title for each axes
# ax.set(title = title)
# Make the grid horizontal instead of vertical
# ax.xaxis.grid(False)
# ax.yaxis.grid(True)
# plt.tight_layout()
# plt.show()
.
.
Feature 1¶By understanding the top DSG's per state, we esablish a baseline for what people in the state normally get treated for. This information is useful in one of two ways:
Assuming that the fraudulent users would try to get treated for the same conditions as the population.
drg_by_state = df.groupby(['Provider_State', 'DRG_Definition']).agg({'DRG_Definition': 'count'})
drg_by_state.head()
drg_by_state.tail()
I am not merging this with the original dataset as this is a new data table created to be used as a reference to check the most common DRG's by state
Feature 2¶This information can be used in one of two ways:
providers_per_city = df.groupby(['Provider_City']).agg({'Provider_Name':'count'})
providers_per_city.head()
fig = plt.figure(figsize=(10,7))
providers_per_city.sort_values(by = 'Provider_Name', ascending = False).head()
sns.distplot(providers_per_city)
plt.tight_layout()
I am not merging this with the original dataset as this is a new data table created to be used as a reference to check the most common hospitals.
df['Average_Cost_Per_Procedure'] = df['Average_Total_Payments']/df['Total_Discharges']
df.head(2)
fig = plt.figure(figsize=(15,7))
plt.boxplot(df['Average_Cost_Per_Procedure'], vert=False)
plt.title('Averate Cost Per Procedure')
plt.xlabel("Cost in $")
plt.tight_layout()
df['Medicare_%_Paid'] = df['Average_Medicare_Payments']/df['Average_Total_Payments']
df.head(2)
fig = plt.figure(figsize=(15,7))
sns.boxplot(df['Medicare_%_Paid'])
plt.title('Percentage of Average Total Payments Paid by Medicare (Average)')
plt.tight_layout()
Most average medicare payments are around 80-90% of the average total payments.
medicare_pct_state = df.groupby('Provider_State').agg({'Medicare_%_Paid': 'mean'}).reset_index()
medicare_pct_state.head(5)
df = pd.merge(left = df, right = medicare_pct_state, left_on = 'Provider_State', right_on = 'Provider_State',
how = 'left')
df.rename(columns = {'Medicare_%_Paid_x':'Medicare_%_Paid',
'Medicare_%_Paid_y':'Medicare_%_Paid_State'}, inplace=True)
df.head(2)
fig = px.scatter(df, x="Provider_State", y="Medicare_%_Paid_State", width=1000, height=500,
color = "Provider_State",
title = "Medicare % Paid Distribution (by State)")
fig.show()
#fig = plt.figure(figsize=(15,7))
#sns.distplot(df['Medicare_%_Paid_State'])
#plt.title('Medicare % Paid Distribution (by State)')
#plt.tight_layout()
Feature 6¶Out of pocket highlight procedures/treatments that are most expensive. The hypothesis is that the procedures with the highest out of pocket cost are the least likely to be a target for fraud.
df['Out_of_Pocket_Payment'] = df['Average_Total_Payments'] - df['Average_Medicare_Payments']
df.head(2)
sorted_avg_out_of_pocket = df.groupby(['DRG_Definition']).agg({'Out_of_Pocket_Payment': 'mean'})
sorted_avg_out_of_pocket.sort_values(by = 'Out_of_Pocket_Payment',ascending=False).head()
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Out_of_Pocket_Payment'])
plt.title('Medicare_%_Paid_Distribution_by_State')
plt.tight_layout()
df['Out_of_Pocket_per_discharge'] = df['Out_of_Pocket_Payment']/df['Total_Discharges']
df.head(2)
fig = plt.figure(figsize=(15,7))
sns.distplot(df['Out_of_Pocket_per_discharge'])
plt.tight_layout()
patients_states = df['Provider_State'].value_counts()
patients_states = pd.DataFrame(patients_states).reset_index()
patients_states.columns = ['Provider_State','Count']
patients_states.head()
df = pd.merge(left = df, right = patients_states, left_on = 'Provider_State', right_on = 'Provider_State',
how = 'left')
df.rename(columns = {'Count':'State_Total'}, inplace=True)
df.head(2)
fig = px.scatter(df, x="Provider_State", y="State_Total", width=1000, height=500,
color = "Provider_State",
title = "Total Procedures/Treatments per state")
fig.show()
#fig = plt.figure(figsize=(15,7))
#sns.distplot(df['State_Total'])
#plt.tight_layout()
patient_avg_state = df.groupby('Provider_State').mean()[['Total_Discharges',
'Average_Total_Payments',
'Average_Medicare_Payments']].reset_index()
patient_avg_state.head()
patient_avg_state.loc[:,'Total_Discharges':'Average_Medicare_Payments'].corr()
fig = plt.figure(figsize=(15,10))
plt.subplot(2, 2, 1)
plt.boxplot(patient_avg_state['Total_Discharges'])
plt.title('Total Disharge Box plot')
plt.xlabel('')
plt.subplot(2, 2, 3)
plt.boxplot(patient_avg_state['Average_Total_Payments'])
plt.title('Average Total Payment Boxplot')
plt.xlabel('')
plt.subplot(2, 2, 4)
plt.boxplot(patient_avg_state['Average_Medicare_Payments'])
plt.title('Average Medicare Payment Boxplot')
plt.tight_layout()
plt.show()
# States with highest discharges
patient_avg_state.sort_values(by = 'Total_Discharges', ascending = False).head()
# States with highest Average Total Payments
patient_avg_state.sort_values(by = 'Average_Total_Payments', ascending = False).head()
These new mean columns are useful, but I believe Median columns will be more handy. So, I will not add the mean columns to the original dataset yet.
Feature 10¶To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.
median_drg_state = df.groupby(['DRG_Definition','Provider_State'])['Average_Total_Payments'].median().reset_index()
median_drg_state.head()
df = pd.merge(left = df, right = median_drg_state, left_on = ['DRG_Definition','Provider_State'], right_on = ['DRG_Definition','Provider_State'], how = 'left')
df.rename(columns={'Average_Total_Payments_y':'Median_Avg_Total_Pyt',
'Average_Total_Payments_x':'Average_Total_Payments'}, inplace=True)
df.head(2)
# Check for one particular state and one particular DRG, to see median
df[(df['Provider_State'] == 'NV') & (df['DRG_Definition'] == '194 - SIMPLE PNEUMONIA & PLEURISY W CC')].head(2)
Feature 11¶I will now take the average total payment and divide it by the median payment to generate a simple score that indicates how many times the current payment amount is larger than the median amount.
df['Median_Score'] = df['Average_Total_Payments']/df['Median_Avg_Total_Pyt']
df.head(2)
# fig = plt.figure(figsize=(10,5))
# sns.distplot(df['Median_Score'])
# plt.tight_layout()
fig = plt.figure(figsize=(15,5))
sns.boxplot(df['Median_Score'])
plt.tight_layout()
df['Median_Score'].describe()
It would appear that most treatment payment amounts for the same DRG within the same state are within 90% to 110% of the median price. This is expected as normally doctors should be charging similar prices for similar treatment in simlar areas. However, we see instances where the payment amount is many times that of the median.
As we see in the box plot above, in two specific cases, treament cost over 9 times the median amount.
Let us investigate further.
df[df['Median_Score'] >= 6]
The highest median score is 9.33 which is for:
This individual was charged USD 41,482 for a treament that had median amount of just USD 4,441.92. This particular treatment was performed at Provider - STURDY MEMORIAL HOSPITAL.
suspect_hospital1 = df[df['Provider_Name'] == 'STURDY MEMORIAL HOSPITAL']['Median_Score']
fig = plt.figure(figsize=(14,5))
sns.distplot(suspect_hospital1)
plt.tight_layout()
print(" ")
print("Median Score distribution for Provider - STURDY MEMORIAL HOSPITAL is as follows")
print(" ")
print(suspect_hospital1.describe())
The graphical representation is strange. The hospital typically charges less than the median amount for its treatments as we see that the highest number of observations fall beloew the median score of 1. This makes the treament with the median mulitple score of over 9 highly unusal.
Lets examine this hosptial's track record too.
suspect_hospital2 = df[df['Provider_Name'] == 'ST JOSEPH MEDICAL CENTER']['Median_Score']
fig = plt.figure(figsize=(14,5))
sns.distplot(suspect_hospital2)
plt.tight_layout()
print(" ")
print("Median Score distribution for Provider - ST JOSEPH MEDICAL CENTER is as follows")
print(" ")
print(suspect_hospital2.describe())
Again, this is a hospital that typically charges resonable prices for treatment, most observations are below the median score of 2, which is fine. But, this makes the one treatment that is over 9 times the median price an extreme outlier deserving of extra attention.
Making the broad assumption that around 1% of medical payments are fraudulent. I will tag any medical treatment that paid more than the top 99th percentile of median_scores in the dataset.
np.percentile(df['Median_Score'], 99)
1% of medical treatments cost more than 1.79 times the median payment of that treament by state. Treaments that paid more than this shall be flagged.
Feature 12¶To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.
df['Median_Score_Flag'] = df['Median_Score'] >= np.percentile(df['Median_Score'], 99)
df.head(2)
This approach is good for finding providers that overcharge substancially. However, it is not so useful in find hospitals that overcharge slightly but over the course of many treatments. One way to find these providers is to find which ones have the highest average median_score.
Feature 13¶To catch payments for treaments that exceed a normal amount, I will first create a feature that calculates a score for each data row. The score will indicate the size the payment amount for a particular treament relative to the median size of the payment amount for the DRG code by state level.
Median_Score_by_Provider = df.groupby(['Provider_Name']).mean()['Median_Score'].reset_index()
print(Median_Score_by_Provider.head(2))
fig = plt.figure(figsize=(10,5))
sns.distplot(Median_Score_by_Provider['Median_Score'])
plt.tight_layout()
print(" ")
print("Median Score distribution for Provider - is as follows")
print(" ")
print(Median_Score_by_Provider.describe())
Some providers consistantly charge multiple times the median payment amount. However, before we blow the horn on these hospitals, we have to consider the fact that perhaps these hospitals are charging more than their state counter parts because they are either high-end/luxury hosptials and/or they are located in expensive cities. Lets see what hospitals these are.
df = pd.merge(left = df, right = Median_Score_by_Provider, left_on = 'Provider_Name',
right_on = 'Provider_Name', how = 'left')
df.rename(columns={'Median_Score_x':'Median_Score',
'Median_Score_y':'Median_Score_by_Provider'}, inplace=True)
df.head(2)
I had expected these hospitals to either be luxury hospitals or located in expensive cities.
df[df['Median_Score_by_Provider'] >= 2][['Provider_Name', 'Provider_State','Provider_City']].drop_duplicates()
Feature 14¶I set the benchmark at an upper level of 2 or higher median score, this will avoid situations where the cost of the city impacts the cost of treatment as even in the most expensive cities, the average cost of treament will in an expensive city will never be double the average cost of treament outside the city within the same state.
df['Provider_Flag_by_Median_Score'] = df['Median_Score_by_Provider'] >= 2
df.head(2)
df[df['Provider_Flag_by_Median_Score']][['Provider_Id','Provider_Name','Provider_City',
'Provider_State','Median_Score_by_Provider']].drop_duplicates()
I will now pick the hospital with the highest score, which is - CANCER TREATMENT CENTERS OF AMERICA.
Now, I assume this hospital constantly overcharges on procedures that are much cheaper in other hospitals within the same state. Lets see it below.
df[df['Provider_Name'] == 'MEMORIAL HOSPITAL LOS BANOS'][['Median_Score']]
As we see, this is an effective method to track and find hospitals who overcharge or committ fraud.
Feature 15¶I will check the grand total of the number of discharges made by ech Provider. The hypothesis is that the hospitals with the highest number of diacharges are more susceptible to fraud, having false patients and fake claims.
discharges_provider = df.groupby(['Provider_Id','Provider_Name'])['Total_Discharges'].sum().reset_index(name='Grand Total of Discharges')
discharges_provider = discharges_provider.sort_values(by='Grand Total of Discharges', ascending=False)
discharges_provider.head()
It is seen that the highest discharges are by 'Florida Hospital', at 25,828.
However, it would be ideal to check the total discharges as group be Provider State to get better inference.
discharges_provider_state = df.groupby(['Provider_State','Provider_Id', 'Provider_Name'])['Total_Discharges'].sum().reset_index(name='Grand Total of Discharges')
discharges_provider_state = discharges_provider_state.sort_values(by='Grand Total of Discharges', ascending=False)
discharges_provider_state.head()
Feature 16¶This ratio will give the an idea as to whether the average payments are higher or lower than the median income of the population. The hypothesis is that if the ratio is high, then the persons in that zip code are paying much higher than their median income for the treatment, which might be a fraud case, as the patient/hospital might have inflated bills.
Even the same hospital may have lower ratio for a particular DRG or State, but higher for another one, so I will not group the data. I need individual ratio for each row entry.
df['Avg_Payment_by_Median_Income'] = df['Average_Total_Payments'] / df['Zip_Median_Income']
df.head(2)
fig, ax = plt.subplots(figsize= (20,5))
sns.boxplot(df["Avg_Payment_by_Median_Income"])
plt.title("Distribution of Average Total Payments by Zipcode Median Income", fontsize=20)
The ones which are over 5 might need some investigation.
Feature 17¶Hospitals havings very ratio are likely to be showing false patients.
Those with higher ratio, might need further investigation. Even the same hospital may have lower ratio for a particular DRG or State, but higher for another one, so I will not group the data. I need individual ratio for each row entry.
df['Total_Disc_by_Pop'] = df['Total_Discharges'] / df['Zip_Population']
df.head(2)
fig, ax = plt.subplots(figsize= (20,5))
sns.boxplot(df["Total_Disc_by_Pop"])
plt.title("Distribution of Total Discharges by Zipcode Population", fontsize=20)
The ratio's over 10 might need investigation.
.
.
I will create a new copy of the data to perform the analysis, and call it 'df1'.
# Create a copy of the original dataset
df1 = df.copy()
# Check nulls
df1.isnull().sum().sum()
These NA's are created as a reuslt of merging the Income and the Population data by zipcode. There are certain zip codes in our data set, which do not have an entry n the zipcode dataset.
So, I will replace the NA's with the median of the respective column.
df1 = df1.fillna(df1.median())
df1.isnull().sum().sum()
There a lot of columns which will not be useful in the kmeans clustering analysis, most of which are categorical columns. I will drop them.
df1 = df1.drop(columns = ['Provider_Id', 'Provider_Name', 'Provider_Street_Address',
'Provider_City', 'Provider_Zip_Code',
'Hospital_Referral_Region_Description',
'Region', 'Total_Discharges',
'Division', 'State_Total', 'Zip_Median_Income',
'Zip_Population', 'Median_Score_Flag',
'Provider_Flag_by_Median_Score'])
df1.head(2)
Label en-coding will be misleading for the clustering, and one-hot encoding creates many different new binary features, which is not ideal for a kNN and PCA clustering.
df1 = df1.drop(columns = ['DRG_Definition', 'Provider_State'])
# All numeric / float variables in thedataset
num_variables_all = ['Average_Covered_Charges', 'Average_Total_Payments', 'Average_Medicare_Payments',
'Average_Cost_Per_Procedure', 'Medicare_%_Paid',
'Medicare_%_Paid_State', 'Out_of_Pocket_Payment',
'Out_of_Pocket_per_discharge', 'Median_Avg_Total_Pyt',
'Median_Score', 'Median_Score_by_Provider',
'Avg_Payment_by_Median_Income',
'Total_Disc_by_Pop']
Drop one column from a pair with a ratio of above 0.7
corr = df1.corr()
f,ax = plt.subplots(figsize=(15,10))
sns.heatmap(corr, annot=True, cmap='Greens', linewidths=.4, fmt= '.1f', ax=ax)
plt.show()
# Remove one each from the pair of highly collinear variables
df1 = df1.drop(columns = ['Average_Medicare_Payments', 'Average_Covered_Charges',
'Median_Avg_Total_Pyt', 'Out_of_Pocket_per_discharge',
'Average_Cost_Per_Procedure', 'Median_Score_by_Provider',
'Avg_Payment_by_Median_Income'])
# Final numeric variables selected
num_variables = ['Average_Total_Payments',
'Medicare_%_Paid',
'Medicare_%_Paid_State', 'Out_of_Pocket_Payment',
'Median_Score',
'Total_Disc_by_Pop']
I will use Standard Scalar/Standardize to scale all the numerical variables
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
X.isnull().sum().sum()
X.shape[0] * 0.80
X_train = X.iloc[:130452,:]
X_test = X.iloc[130453:,:]
print("Shape of new dataframes - {} , {}".format(X_train.shape, X_test.shape))
An autoencoder is a type of artificial neural network used to learn efficient data codings in an unsupervised manner. The aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore signal “noise”.
In order to give you a good sense of what the data look like, I use PCA reduce to two dimensions and plot accordingly.
pca = PCA(2)
x_pca = pca.fit_transform(X_train)
x_pca = pd.DataFrame(x_pca)
x_pca.columns=['PC1','PC2']
x_pca.head()
# Plot
f,ax = plt.subplots(figsize=(8,5))
plt.scatter(x = x_pca['PC1'], y = x_pca['PC2'], alpha = 0.3)
plt.title('Scatter plot of PC1 and PC2')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
The light colored points might seem as outliers or anomalies.
# train kNN detector
clf1 = AutoEncoder(hidden_neurons =[6, 5, 5, 6])
clf1.fit(X_train)
This is because even the train dataset containes outliers, and I need to idenify outliers in the entire dataset.
y_train_scores = clf1.decision_scores_
y_train_scores
# get the prediction on the test data
y_test_pred = clf1.predict(X) # outlier labels (0 or 1)
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
y_test_scores = clf1.decision_function(X) # outlier scores
y_test_pred = pd.Series(y_test_pred)
y_test_scores = pd.Series(y_test_scores)
y_test_pred.value_counts()
# Plot it!
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores, bins=150)
plt.title("Histogram for Model Autoencoder Clf1, Anomaly Scores, with 150 bins")
plt.xlabel('Score')
plt.show()
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 10.0]
# y_test_scores_small
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores_small, bins='auto')
plt.title("Histogram for Model Autoencoder Clf1, Anomaly Scores, with auto bins")
plt.xlabel('Score')
plt.show()
A high anomaly score probably means more abnormal transaction/score. The histogram above shows there are outliers. I will now chose a range of cut points, to check the distribution of clusters, and then identify some as suspicious.
Generally, looking at the graph, 6.0 seems the ideal number to be a cut point or the reasonable boundary. If we choose 6.0 to be the cut point, we can suggest those >= 6.0 to be outliers.
But, I want to investigate deeper, and I will chose 2 diifferent cut points, which are:
Now, lets evaluate these 2 different cut points, and build 5-cluster model.
df_test = pd.DataFrame(X)
df_test['score'] = y_test_scores
df_test['cluster'] = (np.where(df_test['score'] > 10.0, 3,
(np.where(df_test['score'] > 6.0, 2, 1))))
df_test['cluster'].value_counts()
temp = pd.DataFrame()
temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']
temp
I will provde the description and business insight later below, when I aggregate with the 'Average' method
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
X_train = X.iloc[:130452,:]
clf2 = AutoEncoder(hidden_neurons =[6, 5, 2, 5, 6], epochs = 20)
clf2.fit(X_train)
## I chose epochs = 20, as I notice that after 16th iteration, the loss stabilizes.
y_train_scores = clf2.decision_scores_
y_train_scores
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
y_test_scores = clf2.decision_function(X) # outlier scores
y_test_scores = pd.Series(y_test_scores)
# Plot it!
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores, bins=150)
plt.title("Histogram for Model Autoencoder Clf2, Anomaly Scores, with 150 bins")
plt.xlabel('Score')
plt.show()
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 10.0]
# y_test_scores_small
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores_small, bins='auto')
plt.title("Histogram for Model Autoencoder Clf1, Anomaly Scores, with auto bins")
plt.xlabel('Score')
plt.show()
df_test = pd.DataFrame(X)
df_test['score'] = y_test_scores
df_test['cluster'] = (np.where(df_test['score'] > 10.0, 3,
(np.where(df_test['score'] > 6.0, 2, 1))))
df_test['cluster'].value_counts()
temp = pd.DataFrame()
temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']
temp
I will provde the description and business insight later below, when I aggregate with the 'Average' method
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
X_train = X.iloc[:130452,:]
clf3 = AutoEncoder(hidden_neurons =[6, 5, 3, 2, 3, 5, 6], epochs = 20)
clf3.fit(X_train)
pd.DataFrame.from_dict(clf3.history_).plot(title='Error Loss History');
y_train_scores = clf3.decision_scores_
y_train_scores
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
y_test_scores = clf3.decision_function(X) # outlier scores
y_test_scores = pd.Series(y_test_scores)
# Plot it!
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores, bins=150)
plt.title("Histogram for Model Autoencoder Clf3, Anomaly Scores, with 150 bins")
plt.xlabel('Score')
plt.show()
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 10.0]
# y_test_scores_small
f,ax = plt.subplots(figsize=(15,5))
plt.hist(y_test_scores_small, bins='auto')
plt.title("Histogram for Model Autoencoder Clf3, Anomaly Scores, with auto bins")
plt.xlabel('Score')
plt.show()
df_test = pd.DataFrame(X)
df_test['score'] = y_test_scores
df_test['cluster'] = (np.where(df_test['score'] > 10.0, 3,
(np.where(df_test['score'] > 6.0, 2, 1))))
df_test['cluster'].value_counts()
temp = pd.DataFrame()
temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']
temp
I will provde the description and business insight later below, when I aggregate with the 'Average' method
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
All three models produce the exact same number of data points in each cluster.
Outlier Detection Models Are Prone and sensitive to Outliers/Anomalies
Also, Outlier detection models tend to be very sensitive to outliers. In our case the unsupervised k-NN model can easily commit overfitting. We need to produce stable models.
The solution is to train multiple models then aggregate the scores. By aggregating multiple models, the chance of overfitting is greatly reduced and the prediction accuracy will be improved. The PyOD module offers four methods to aggregate the outcome. I will use the 'Average' method for my analysis.
new_X = X.drop(columns = ['score', 'cluster'])
# Put all the predictions in a data frame
train_scores = pd.DataFrame({'clf1': clf1.decision_scores_,
'clf2': clf2.decision_scores_,
'clf3': clf3.decision_scores_
})
test_scores = pd.DataFrame({'clf1': clf1.decision_function(new_X),
'clf2': clf2.decision_function(new_X),
'clf3': clf3.decision_function(new_X)
})
# Although we did standardization before, it was for the variables.
# Now we do the standardization for the decision scores
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
y_by_average = average(test_scores_norm)
y_by_average[1:10]
fig, ax = plt.subplots(figsize= (20,6))
plt.plot(y_by_average);
plt.axhline(y=clf3.threshold_, c='r', ls='dotted', label='threshoold');
plt.title('Anomaly Scores with automatically calculated threshold, for model Autoencoder Clf3');
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_by_average, bins=200) # arguments are passed to np.histogram
plt.title("Combination by average, for model Autoencoder, with 200 bins")
plt.xlabel('Score')
plt.show()
y_by_average_small = y_by_average[y_by_average[:, ] < 8.0]
y_by_average_small
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_by_average_small, bins = 'auto') # arguments are passed to np.histogram
plt.title("Histogram for Combination by Average, Autoencoder, Anomaly Scores, with auto bins")
plt.xlabel('Score')
plt.show()
df_test = pd.DataFrame(X)
df_test['score'] = y_by_average
df_test['cluster'] = (np.where(df_test['score'] > 10.0, 3,
(np.where(df_test['score'] > 2.0, 2, 1))))
df_test['cluster'].value_counts()
temp = pd.DataFrame()
temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']
temp
fig = px.bar(temp, x = 'Cluster', y = 'Percentage of total', color = 'Cluster',
width=600, height=400,
title = "Percentage of total in each cluster")
fig.show()
cluster_df = df_test.groupby('cluster').mean().reset_index()
cluster_df
cluster_df.plot(x = 'cluster', y = 'Out_of_Pocket_Payment', color = 'green', kind = 'bar')
cluster_df.plot(x = 'cluster', y ='Total_Disc_by_Pop' , color = 'red', kind ='bar')
cluster_df.plot(x = 'cluster', y ='score' , color = 'blue', kind ='bar')
From the above 3 cluster analysis, the following can be explained:
A. Clusters 2, and 3 have a very small percentage of total data in their cluster pool. So, there are two possible options for these clusters:
B. Cluster 3 has a roughly 0.07%, and Cluster 2 has roughly 2.08% of the data in its cluster pool. These are the critical cut-off points, as I believe anything near or less than 10% of the entire data, represents a different but normal cluster group.
C. The other clusters which have over 10% of the total data in their cluster pools, seem normal and fine. The differences lie probably due to state or the drg definition or the state. But they are not not likely to be a cause for anomalies.
The above table gives a comprehensive view of the means, as per the different clusters which is useful to identofy anomalies.
From the final 6 features used for clustering, I generated 3 different clusters using the aggregated average method. The following are the key takeaways for anomalies and suspicious clusters:
Total Discharges by Population
Average Total Payments
Out of Pocket Payment
Median Score
From the above observations, it is clear that there is 1 distinct cluster, which has a very high cluster mean for a lot of variables, if not all, and emerges as suspicious cases or has anomalies. This is cluster 3.
Combining these results with the Percentage results obtained in the previous step, I can concule that Clusters 3 is suspicious and need further investigation.
I performed the iForest clustering for the data, and also used the 'Average' aggregate method to build a stable model. On building the average model, I found that the original model I built was stable and had good insights.
My analysis gave me 3 clusters, and out of those 3, Cluster 3 seems to be the one which has a lot a potential anomalies, and is suspicious because the means for the variables, as explained above, are far away from the means of the variables of other clusters.
Further, on evaluating the anomaly score for all the clusters, which gives us insights about the clusters which are anomalies, as the anomalies might have a very very high score comapred to others, I see that cluster 3 has a score almost 19 times higher than all other clusters. So, I can safely conclude that Cluster 3 is highly suspicious.
So, I would pass on the 130 specific entries of the Cluster 3 to the relevant authorities, and call for further investigation on each of the entries, to understand of they are true anomalies. I will provide all the reasoning as I have highlighted above, as to the differences in the means, and walk through the process I have done.
I will use Standard Scalar to scale all the numerical variables
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
X_train = X.iloc[:130452,:]
# Here, I will take the length of train as the max samples
clf1 = IForest(behaviour="new", max_samples=len(X_train))
clf1.fit(X_train)
This is because even the train dataset containes outliers, and I need to idenify outliers in the entire dataset.
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
# We apply the model to the test data X to get the outlier scores.
y_test_scores = clf1.decision_function(X) # outlier scores
y_test_scores = pd.Series(y_test_scores)
y_test_scores.head()
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores, bins =150) # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf1, Anomaly Scores, with 150 bins")
plt.xlabel('Distance')
plt.show()
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 0.2]
#y_test_scores_small
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores_small, bins = 'auto') # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf1, Anomaly Scores, with auto bins")
plt.xlabel('Distance')
plt.show()
A high anomaly score probably means more abnormal transaction/score. The histogram above shows there are outliers. I will now chose a range of cut points, to check the distribution of clusters, and then identify some as suspicious.
Generally, looking at the graph, 0.05 seems the ideal number to be a cut point or the reasonable boundary. If we choose 0.05 to be the cut point, we can suggest those >= 0.05 to be outliers.
But, I want to investigate deeper, and I will chose 2 diifferent cut points, which are:
Now, lets evaluate these 2 different cut points, and build 5-cluster model.
df_test = pd.DataFrame(X)
df_test['distance'] = y_test_scores
df_test['cluster'] = (np.where(df_test['distance'] > 0.10, 3,
(np.where(df_test['distance'] > 0.05, 2, 1))))
df_test['cluster'].value_counts()
temp = pd.DataFrame()
temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']
temp
I will provde the description and business insight later below, when I aggregate with the 'Average' method
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
X_train = X.iloc[:130452,:]
# Here, I will take the 80% of the length of train as the max samples
clf2 = IForest(behaviour="new", max_samples=int(round((len(X_train)*0.8),0)))
clf2.fit(X_train)
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
# We apply the model to the test data X to get the outlier scores.
y_test_scores = clf2.decision_function(X) # outlier scores
y_test_scores = pd.Series(y_test_scores)
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores, bins =150) # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf2, Anomaly Scores, with 150 bins")
plt.xlabel('Distance')
plt.show()
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 0.2]
#y_test_scores_small
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores_small, bins = 'auto') # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf2, Anomaly Scores, with auto bins")
plt.xlabel('Distance')
plt.show()
df_test = pd.DataFrame(X)
df_test['distance'] = y_test_scores
df_test['cluster'] = (np.where(df_test['distance'] > 0.10, 3,
(np.where(df_test['distance'] > 0.05, 2, 1))))
df_test['cluster'].value_counts()
temp = pd.DataFrame()
temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']
temp
I will provde the description and business insight later below, when I aggregate with the 'Average' method
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
X = StandardScaler().fit_transform(df1)
X = pd.DataFrame(data = X, columns = num_variables)
X_train = X.iloc[:130452,:]
# Here, I will take the 60% of the length of train as the max samples
clf3 = IForest(behaviour="new", max_samples=int(round((len(X_train)*0.6),0)))
clf3.fit(X_train)
# clf.decision_function: Predict raw anomaly score of X using the fitted detector.
# We apply the model to the test data X to get the outlier scores.
y_test_scores = clf3.decision_function(X) # outlier scores
y_test_scores = pd.Series(y_test_scores)
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores, bins =150) # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf3, Anomaly Scores, with 150 bins")
plt.xlabel('Distance')
plt.show()
y_test_scores_small = y_test_scores[y_test_scores[:, ] < 0.2]
#y_test_scores_small
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_test_scores_small, bins = 'auto') # arguments are passed to np.histogram
plt.title("Histogram for Model iForest Clf3, Anomaly Scores, with auto bins")
plt.xlabel('Distance')
plt.show()
df_test = pd.DataFrame(X)
df_test['distance'] = y_test_scores
df_test['cluster'] = (np.where(df_test['distance'] > 0.10, 3,
(np.where(df_test['distance'] > 0.05, 2, 1))))
df_test['cluster'].value_counts()
temp = pd.DataFrame()
temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']
temp
I will provde the description and business insight later below, when I aggregate with the 'Average' method
# Now let's show the summary statistics:
df_test.groupby('cluster').mean()
Outlier Detection Models Are Prone and sensitive to Outliers/Anomalies
Also, Outlier detection models tend to be very sensitive to outliers. In our case the unsupervised k-NN model can easily commit overfitting. We need to produce stable models.
The solution is to train multiple models then aggregate the scores. By aggregating multiple models, the chance of overfitting is greatly reduced and the prediction accuracy will be improved. The PyOD module offers four methods to aggregate the outcome. I will use the 'Average' method for my analysis.
new_X = X.drop(columns = ['distance', 'cluster'])
# Put all the predictions in a data frame
train_scores = pd.DataFrame({'clf1': clf1.decision_scores_,
'clf2': clf2.decision_scores_,
'clf3': clf3.decision_scores_
})
test_scores = pd.DataFrame({'clf1': clf1.decision_function(new_X),
'clf2': clf2.decision_function(new_X),
'clf3': clf3.decision_function(new_X)
})
# Although we did standardization before, it was for the variables.
# Now we do the standardization for the decision scores
train_scores_norm, test_scores_norm = standardizer(train_scores,test_scores)
y_by_average = average(test_scores_norm)
y_by_average[1:10]
fig, ax = plt.subplots(figsize= (20,6))
plt.plot(y_by_average);
plt.axhline(y=clf3.threshold_, c='r', ls='dotted', label='threshoold');
plt.title('Anomaly Distances with automatically calculated threshold, for Model iForest Clf3');
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_by_average, bins=200) # arguments are passed to np.histogram
plt.title("Combination by average, for model iForest, with 200 bins")
plt.xlabel('Distances')
plt.show()
y_by_average_small = y_by_average[y_by_average[:, ] < 8.0]
y_by_average_small
fig, ax = plt.subplots(figsize= (20,6))
plt.hist(y_by_average_small, bins = 'auto') # arguments are passed to np.histogram
plt.title("Combination by average, for model iForest, with auto bins")
plt.xlabel('Distances')
plt.show()
df_test = pd.DataFrame(X)
df_test['distance'] = y_by_average
df_test['cluster'] = (np.where(df_test['distance'] > 6.0, 3,
(np.where(df_test['distance'] > 3.0, 2, 1))))
df_test['cluster'].value_counts()
temp = pd.DataFrame()
temp['Percentage of total'] = (df_test['cluster'].value_counts() / df_test['cluster'].value_counts().sum()) * 100
temp['Cluster'] = ['1','2','3']
temp
fig = px.bar(temp, x = 'Cluster', y = 'Percentage of total', color = 'Cluster',
width=600, height=400,
title = "Percentage of total in each cluster")
fig.show()
cluster_df = df_test.groupby('cluster').mean().reset_index()
cluster_df
cluster_df.plot(x = 'cluster', y = 'Out_of_Pocket_Payment', color = 'green', kind = 'bar')
cluster_df.plot(x = 'cluster', y ='Total_Disc_by_Pop' , color = 'red', kind ='bar')
cluster_df.plot(x = 'cluster', y ='distance' , color = 'blue', kind ='bar')
From the above 3 cluster analysis, the following can be explained:
A. Clusters 2, and 3 have a very small percentage of total data in their cluster pool. So, there are two possible options for these clusters:
B. Cluster 3 has a roughly 0.24%, and Cluster 2 has roughly 1.64% of the data in its cluster pool. These are the critical cut-off points, as I believe anything near or less than 10% of the entire data, represents a different but normal cluster group.
C. The other clusters which have over 10% of the total data in their cluster pools, seem normal and fine. The differences lie probably due to state or the drg definition or the state. But they are not not likely to be a cause for anomalies.
The above table gives a comprehensive view of the means, as per the different clusters which is useful to identofy anomalies.
From the final 6 features used for clustering, I generated 3 different clusters using the aggregated average method. The following are the key takeaways for anomalies and suspicious clusters:
Total Discharges by Population
Average Total Payments
Out of Pocket Payment
Median Score
From the above observations, it is clear that there is 1 distinct cluster, which has a very high cluster mean for a lot of variables, if not all, and emerges as suspicious cases or has anomalies. This is cluster 3.
Combining these results with the Percentage results obtained in the previous step, I can concule that Clusters 3 issuspicious and need further investigation.
I performed the Autoencoder clustering for the data, and also used the 'Average' aggregate method to build a stable model. On building the average model, I found that the original model I built was stable and had good insights.
My analysis gave me 3 clusters, and out of those 3, Cluster 3 seems to be the one which has a lot a potential anomalies, and is suspicious because the means for the variables, as explained above, are far away from the means of the variables of other clusters.
Further, on evaluating the distance for all the clusters, which gives us insights about the clusters which are anomalies, as the anomalies might have a very very high score comapred to others, I see that cluster 3 has a score almost 5-6 times higher than all other clusters. So, I can safely conclude that Cluster 3 is highly suspicious.
So, I would pass on the 395 specific entries of the Cluster 3 to the relevant authorities, and call for further investigation on each of the entries, to understand of they are true anomalies. I will provide all the reasoning as I have highlighted above, as to the differences in the means, and walk through the process I have done.
Autoencoder and iForest are are very good models to identify cluster who might be suspicious or have anomalies. For performing the analysis, I used 3 clusters as the optimum number of clusters. An important observation was that: